{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "authorship_tag": "ABX9TyM3/AsVIbC5YW5sKPMqDARh",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/tomasonjo/blogs/blob/master/llm/llm_graph_transformer_in_depth.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "zK-Jv5o5qPgm"
      },
      "outputs": [],
      "source": [
        "!pip install --quiet neo4j langchain-community langchain-experimental langchain-openai json-repair"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Building knowledge graphs with LLM Graph Transformer\n",
        "## A deep dive into LangChain's implementation of graph construction with LLMs\n",
        "\n",
        "Creating graphs from text is incredibly exciting, but definitely challenging. Essentially, it's about converting unstructured text into structured data. While this approach has been around for some time, it gained significant traction with the advent of Large Language Models (LLMs), bringing it more into the mainstream.\n",
        "\n",
        "![image](https://cdn-images-1.medium.com/max/1600/0*qgT2hBiA3DA1Y3qu.png)\n",
        "\n",
        "In the image above, you can see how information extraction transforms raw text into a knowledge graph. On the left, multiple documents show unstructured sentences about individuals and their relationships with companies. On the right, this same information is represented as a graph of entities and their connections, showing who worked at or founded various organizations.\n",
        "But why would you want to extract structured information from text and represent it as a graph? One key reason is to power retrieval-augmented generation (RAG) applications. While using text embedding models over unstructured text is an useful approach, it can fall short when it comes to answering complex, multi-hop questions that require understanding connections across multiple entities or question where structured operations like filtering, sorting, and aggregation is required. By extracting structured information from text and constructing knowledge graphs, you not only organize data more effectively but also create a powerful framework for understanding complex relationships between entities. This structured approach makes it much easier to retrieve and leverage specific information, expanding the types of questions you can answer while providing greater accuracy.\n",
        "\n",
        "Around a year ago, I began experimenting with building graphs using LLMs, and due to the growing interest, we decided to integrate this capability into LangChain as the LLM Graph Transformer. Over the past year, we've gained valuable insights and introduced new features, which we'll be showcasing in this blog post.\n",
        "\n",
        "\n",
        "## Setting up Neo4j environment\n",
        "\n",
        "We will use Neo4j as the underlying graph store, which comes with out-of-the box graph visualizations. The easiest way to get started is to use a free instance of Neo4j Aura, which offers cloud instances of the Neo4j database. Alternatively, you can set up a local instance of the Neo4j database by downloading the Neo4j Desktop application and creating a local database instance."
      ],
      "metadata": {
        "id": "MqGypB2mQvhX"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from langchain_community.graphs import Neo4jGraph\n",
        "\n",
        "graph = Neo4jGraph(\n",
        "    url=\"bolt://54.87.130.140:7687\",\n",
        "    username=\"neo4j\",\n",
        "    password=\"cables-anchors-directories\",\n",
        "    refresh_schema=False\n",
        ")\n",
        "\n",
        "def clean_graph():\n",
        "    query = \"\"\"\n",
        "    MATCH (n)\n",
        "    DETACH DELETE n\n",
        "    \"\"\"\n",
        "    graph.query(query)"
      ],
      "metadata": {
        "id": "tbUnI_F-s5rP"
      },
      "execution_count": 2,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## LLM Graph Transformer\n",
        "The LLM Graph Transformer was designed to provide a flexible framework for building graphs using any LLM. With so many different providers and models available, this task is far from simple. Fortunately, LangChain steps in to handle much of the standardization process. As for the LLM Graph Transformer itself, it's like two cats stacked in a trench coat -with the ability to operate in two completely independent modes.\n",
        "\n",
        "![image](https://cdn-images-1.medium.com/max/2400/1*aCSCXuvrOB90jRQ0mNZtSA.png)\n",
        "\n",
        "The LLM Graph Transformer operates in two distinct modes, each designed to generate graphs from documents using an LLM in different scenarios.\n",
        "\n",
        "* **Tool-Based Mode (Default)**: When the LLM supports structured output or function calling, this mode leverages the LLM's built-in with_structured_outputto use tools. The tool specification defines the output format, ensuring that entities and relationships are extracted in a structured, predefined manner. This is depicted on the left side of the image, where code for the Node and Relationship classes is shown.\n",
        "* **Prompt-Based Mode (Fallback)** : In situations where the LLM doesn't support tools or function calls, the LLM Graph Transformer falls back to a purely prompt-driven approach. This mode uses few-shot prompting to define the output format, guiding the LLM to extract entities and relationships in a text-based manner. The results are then parsed through a custom function, which converts the LLM's output into a JSON format. This JSON is used to populate nodes and relationships, just as in the tool-based mode, but here the LLM is guided entirely by prompting rather than structured tools. This is shown on the right side of the image, where an example prompt and resulting JSON output are provided.\n",
        "\n",
        "These two modes ensure that the LLM Graph Transformer is adaptable to different LLMs, allowing it to build graphs either directly using tools or by parsing output from a text-based prompt.\n",
        "\n",
        "_Note that you can use prompt-based extraction even with models that support tools/functions by setting the attribute ignore_tools_usage=True._\n",
        "\n",
        "## Tool-based extraction\n",
        "We initially chose a tool-based approach for extraction since it minimized the need for extensive prompt engineering and custom parsing functions. In LangChain, the with_structured_output method allows you to extract information using tools or functions, with output defined either through a JSON structure or a Pydantic object. Personally, I find Pydantic objects clearer, so we opted for that.\n",
        "We start by defining a Node class.\n",
        "```\n",
        "class Node(BaseNode):\n",
        "    id: str = Field(..., description=\"Name or human-readable unique identifier\")\n",
        "    label: str = Field(..., description=f\"Available options are {enum_values}\")\n",
        "    properties: Optional[List[Property]]\n",
        "```\n",
        "Each node has an id, a label, and optional properties. For brevity, I haven't included full descriptions here. Describing ids as human-readable unique identifier is important since some LLMs tend to understand ID properties in more traditional way like random strings or incremental integers. Instead we want the name of entities to be used as id property. We also limit the available label types by simply listing them in the labeldescription. Additionally, LLMs like OpenAI's, support an enum parameter, which we also use.\n",
        "Next, we take a look at the Relationship class\n",
        "```\n",
        "class Relationship(BaseRelationship):\n",
        "    source_node_id: str\n",
        "    source_node_label: str = Field(..., description=f\"Available options are {enum_values}\")\n",
        "    target_node_id: str\n",
        "    target_node_label: str = Field(..., description=f\"Available options are {enum_values}\")\n",
        "    type: str = Field(..., description=f\"Available options are {enum_values}\")\n",
        "    properties: Optional[List[Property]]\n",
        "```\n",
        "This is the second iteration of the Relationship class. Initially, we used a nested Node object for the source and target nodes, but we quickly found that nested objects reduced the accuracy and quality of the extraction process. So, we decided to flatten the source and target nodes into separate fields-for example, source_node_id and source_node_label, along with target_node_id and target_node_label. Additionally, we define the allowed values in the descriptions for node labels and relationship types to ensure the LLMs adhere to the specified graph schema.\n",
        "The tool-based extraction approach enables us to define properties for both nodes and relationships. Below is the class we used to define them.\n",
        "class Property(BaseModel):\n",
        "    \"\"\"A single property consisting of key and value\"\"\"\n",
        "    key: str = Field(..., description=f\"Available options are {enum_values}\")\n",
        "    value: str\n",
        "Each Property is defined as a key-value pair. While this approach is flexible, it has its limitations. For instance, we can't provide a unique description for each property, nor can we specify certain properties as mandatory while others optional, so all properties are defined as optional. Additionally, properties aren't defined individually for each node or relationship type but are instead shared across all of them.\n",
        "We've also implemented a detailed system prompt to help guide the extraction. In my experience, though, the function and argument descriptions tend to have a greater impact than the system message.\n",
        "Unfortunately, at the moment, there is no simple way to customize function or argument descriptions in LLM Graph Transformer.\n",
        "## Prompt-based extraction\n",
        "Since only a few commercial LLMs and LLaMA 3 support native tools, we implemented a fallback for models without tool support. You can also set ignore_tool_usage=True to switch to a prompt-based approach even when using a model that supports tools.\n",
        "Most of the prompt engineering and examples for the prompt-based approach were contributed by Geraldus Wilsen.\n",
        "With the prompt-based approach, we have to define the output structure directly in the prompt. You can find the whole prompt here. In this blog post, we'll just do a high-level overview. We start by defining the system prompt.\n",
        "```\n",
        "You are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph. Your task is to identify the entities and relations specified in the user prompt from a given text and produce the output in JSON format. This output should be a list of JSON objects, with each object containing the following keys:\n",
        "\n",
        "- **\"head\"**: The text of the extracted entity, which must match one of the types specified in the user prompt.\n",
        "- **\"head_type\"**: The type of the extracted head entity, selected from the specified list of types.\n",
        "- **\"relation\"**: The type of relation between the \"head\" and the \"tail,\" chosen from the list of allowed relations.\n",
        "- **\"tail\"**: The text of the entity representing the tail of the relation.\n",
        "- **\"tail_type\"**: The type of the tail entity, also selected from the provided list of types.\n",
        "\n",
        "Extract as many entities and relationships as possible.\n",
        "\n",
        "**Entity Consistency**: Ensure consistency in entity representation. If an entity, like \"John Doe,\" appears multiple times in the text under different names or pronouns (e.g., \"Joe,\" \"he\"), use the most complete identifier consistently. This consistency is essential for creating a coherent and easily understandable knowledge graph.\n",
        "\n",
        "**Important Notes**:\n",
        "- Do not add any extra explanations or text.\n",
        "```\n",
        "In the prompt-based approach, a key difference is that we ask the LLM to extract only relationships, not individual nodes. This means we won't have any isolated nodes, unlike with the tool-based approach. Additionally, because models lacking native tool support typically perform worse, we do not allow extraction any properties - whether for nodes or relationships, to keep the extraction output simpler.\n",
        "Next, we add a couple of few-shot examples to the model.\n",
        "```\n",
        "examples = [\n",
        "    {\n",
        "        \"text\": (\n",
        "            \"Adam is a software engineer in Microsoft since 2009, \"\n",
        "            \"and last year he got an award as the Best Talent\"\n",
        "        ),\n",
        "        \"head\": \"Adam\",\n",
        "        \"head_type\": \"Person\",\n",
        "        \"relation\": \"WORKS_FOR\",\n",
        "        \"tail\": \"Microsoft\",\n",
        "        \"tail_type\": \"Company\",\n",
        "    },\n",
        "    {\n",
        "        \"text\": (\n",
        "            \"Adam is a software engineer in Microsoft since 2009, \"\n",
        "            \"and last year he got an award as the Best Talent\"\n",
        "        ),\n",
        "        \"head\": \"Adam\",\n",
        "        \"head_type\": \"Person\",\n",
        "        \"relation\": \"HAS_AWARD\",\n",
        "        \"tail\": \"Best Talent\",\n",
        "        \"tail_type\": \"Award\",\n",
        "    },\n",
        "...\n",
        "]\n",
        "```\n",
        "In this approach, there's currently no support for adding custom few-shot examples or extra instructions. The only way to customize is by modifying the entire prompt through the promptattribute. Expanding customization options is something we're actively considering.\n",
        "Next, we'll take a look at defining the graph schema.\n",
        "## Defining the graph schema\n",
        "When using the LLM Graph Transformer for information extraction, defining a graph schema is essential for guiding the model to build meaningful and structured knowledge representations. A well-defined graph schema specifies the types of nodes and relationships to be extracted, along with any attributes associated with each. This schema serves as a blueprint, ensuring that the LLM consistently extracts relevant information in a way that aligns with the desired knowledge graph structure.\n",
        "\n",
        "In this blog post, we'll use the opening paragraph of Marie Curie's Wikipedia page for testing with an added sentence at the end about Robin Williams."
      ],
      "metadata": {
        "id": "nfDGWohxRKGu"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from langchain_core.documents import Document\n",
        "\n",
        "text = \"\"\"\n",
        "Marie Curie, 7 November 1867 – 4 July 1934, was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity.\n",
        "She was the first woman to win a Nobel Prize, the first person to win a Nobel Prize twice, and the only person to win a Nobel Prize in two scientific fields.\n",
        "Her husband, Pierre Curie, was a co-winner of her first Nobel Prize, making them the first-ever married couple to win the Nobel Prize and launching the Curie family legacy of five Nobel Prizes.\n",
        "She was, in 1906, the first woman to become a professor at the University of Paris.\n",
        "Also, Robin Williams!\n",
        "\"\"\"\n",
        "documents = [Document(page_content=text)]"
      ],
      "metadata": {
        "id": "ASyrLxerqUQc"
      },
      "execution_count": 3,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "We'll also be using GPT-4o in all examples."
      ],
      "metadata": {
        "id": "LzuBI8gESEqZ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from langchain_openai import ChatOpenAI\n",
        "import getpass\n",
        "import os\n",
        "\n",
        "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI api key\")\n",
        "\n",
        "llm = ChatOpenAI(model='gpt-4o')"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "g0hE3kk1q1t4",
        "outputId": "b15b3a6a-b080-46b7-d94c-794c6adc1305"
      },
      "execution_count": 4,
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "OpenAI api key··········\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "To start, let's examine how the extraction process works without defining any graph schema."
      ],
      "metadata": {
        "id": "45pjhe5qSI9y"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from langchain_experimental.graph_transformers import LLMGraphTransformer\n",
        "\n",
        "no_schema = LLMGraphTransformer(llm=llm)"
      ],
      "metadata": {
        "id": "6980jrt6rcA-"
      },
      "execution_count": 5,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now we can process the documents using the aconvert_to_graph_documents function, which is asynchronous. Using async with LLM extraction is recommended, as it allows for parallel processing of multiple documents. This approach can significantly reduce wait times and improve throughput, especially when dealing with multiple documents."
      ],
      "metadata": {
        "id": "8X2BPs8BSLJh"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "data = await no_schema.aconvert_to_graph_documents(documents)\n",
        "print(data)\n",
        "graph.add_graph_documents(data)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "CCsfJvlfrshH",
        "outputId": "445f9508-11d5-466d-80cd-7dc2b95947d3"
      },
      "execution_count": 6,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[GraphDocument(nodes=[Node(id='Marie Curie', type='Person', properties={}), Node(id='Pierre Curie', type='Person', properties={}), Node(id='Nobel Prize', type='Award', properties={}), Node(id='University Of Paris', type='Organization', properties={}), Node(id='Radioactivity', type='Concept', properties={})], relationships=[Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Nobel Prize', type='Award', properties={}), type='WON', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Pierre Curie', type='Person', properties={}), type='SPOUSE', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='University Of Paris', type='Organization', properties={}), type='PROFESSOR', properties={}), Relationship(source=Node(id='Marie Curie', type='Person', properties={}), target=Node(id='Radioactivity', type='Concept', properties={}), type='RESEARCHED', properties={})], source=Document(metadata={}, page_content='\\nMarie Curie, 7 November 1867 – 4 July 1934, was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity.\\nShe was the first woman to win a Nobel Prize, the first person to win a Nobel Prize twice, and the only person to win a Nobel Prize in two scientific fields.\\nHer husband, Pierre Curie, was a co-winner of her first Nobel Prize, making them the first-ever married couple to win the Nobel Prize and launching the Curie family legacy of five Nobel Prizes.\\nShe was, in 1906, the first woman to become a professor at the University of Paris.\\nAlso, Robin Williams!\\n'))]\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The response from the LLM Graph Transformer will be a graph document, which has the above structure. The graph document describes extracted nodes and relationships . Additionally, the source document of the extraction is added under the source key."
      ],
      "metadata": {
        "id": "Sa1daHhgSPZb"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "clean_graph()"
      ],
      "metadata": {
        "id": "qcZpEwBmtnqf"
      },
      "execution_count": 7,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now, let's try the same extraction using the prompt-based approach. For models that support tools, you can enable prompt-based extraction by setting the `ignore_tool_usage` parameter."
      ],
      "metadata": {
        "id": "fOsQUsWTScYP"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "no_schema_prompt = LLMGraphTransformer(llm=llm, ignore_tool_usage=True)\n",
        "data = await no_schema_prompt.aconvert_to_graph_documents(documents)\n",
        "graph.add_graph_documents(data)"
      ],
      "metadata": {
        "id": "pSDKSDpmvYgt"
      },
      "execution_count": 8,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "clean_graph()"
      ],
      "metadata": {
        "id": "xyfW25s7xNPe"
      },
      "execution_count": 9,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Next, let's walk through how defining a graph schema can help produce more consistent outputs.\n",
        "\n",
        "## Defining allowed nodes\n",
        "Constraining the extracted graph structure can be highly beneficial, as it guides the model to focus on specific, relevant entities and relationships. By defining a clear schema, you improve consistency across extractions, making the outputs more predictable and aligned with the information you actually need. This reduces variability between runs and ensures that the extracted data follows a standardized structure, capturing expected information. With a well-defined schema, the model is less likely to overlook key details or introduce unexpected elements, resulting in cleaner, more usable graphs.\n",
        "We'll start by defining the expected types of nodes to extract using the allowed_nodesparameter."
      ],
      "metadata": {
        "id": "bgh8QMD5SkYq"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "allowed_nodes = [\"Person\", \"Organization\", \"Location\", \"Award\", \"ResearchField\"]\n",
        "nodes_defined = LLMGraphTransformer(llm=llm, allowed_nodes=allowed_nodes)\n",
        "data = await nodes_defined.aconvert_to_graph_documents(documents)\n",
        "graph.add_graph_documents(data)"
      ],
      "metadata": {
        "id": "qyBPVNenUuXy"
      },
      "execution_count": 10,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Here, we defined that the LLM should extract five types of nodes like Person, Organization, Location, and more."
      ],
      "metadata": {
        "id": "Qb85FwUpSp9K"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "clean_graph()"
      ],
      "metadata": {
        "id": "nAx68NCWVAcQ"
      },
      "execution_count": 11,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "By specifying the expected node types, we achieve more consistent node extraction. However, some variation may still occur. For example, in the first run, \"radioactivity\" was extracted as a research field, while in the second, it was not.\n",
        "Since we haven't defined relationships, their types can also vary across runs. Additionally, some extractions may capture more information than others. For instance, the `MARRIED_TO` relationship between Marie and Pierre isn't present in both extractions.\n",
        "Now, let's explore how defining relationship types can further improve consistency.\n",
        "## Defining allowed relationships\n",
        "As we've observed, defining only node types still allows for variation in relationship extraction. To address this, let's explore how to define relationships as well. The first approach is to specify allowed relationships using a list of available types."
      ],
      "metadata": {
        "id": "QLieiqfBSwzT"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "allowed_nodes = [\"Person\", \"Organization\", \"Place\", \"Award\", \"ResearchField\"]\n",
        "allowed_relationships = [\"SPOUSE\", \"AWARD\", \"FIELD_OF_RESEARCH\", \"WORKS_AT\", \"IN_LOCATION\"]\n",
        "rels_defined = LLMGraphTransformer(\n",
        "  llm=llm,\n",
        "  allowed_nodes=allowed_nodes,\n",
        "  allowed_relationships=allowed_relationships\n",
        ")\n",
        "data = await rels_defined.aconvert_to_graph_documents(documents)\n",
        "graph.add_graph_documents(data)"
      ],
      "metadata": {
        "id": "s6EPH9bHJWuh"
      },
      "execution_count": 12,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "With both nodes and relationships defined, our outputs become significantly more consistent. For example, Marie is always shown as winning an award, being the spouse of Pierre, and working at the University of Paris. However, since relationships are specified as a general list without restrictions on which nodes they can connect, some variation still occurs. For instance, the `FIELD_OF_RESEARCH` relationship might appear between a `Person` and a `ResearchField`, but sometimes it links an `Award` to a `ResearchField`. Additionally, since relationship directions aren't defined, there may be differences in directional consistency.\n",
        "To address the issues of not being able to specify which nodes a relationship can connect and enforcing relationship direction, we recently introduced a new option for defining relationships, as shown below."
      ],
      "metadata": {
        "id": "VgjnAgoFS3-8"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "clean_graph()"
      ],
      "metadata": {
        "id": "VBJKhP__hz3X"
      },
      "execution_count": 13,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Rather than defining relationships as a simple list of strings, we now use a three-element tuple format, where the elements represents the source node, relationship type, and target node, respectively."
      ],
      "metadata": {
        "id": "en0r7ZabTCzT"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "allowed_nodes = [\"Person\", \"Organization\", \"Location\", \"Award\", \"ResearchField\"]\n",
        "allowed_relationships = [\n",
        "    (\"Person\", \"SPOUSE\", \"Person\"),\n",
        "    (\"Person\", \"AWARD\", \"Award\"),\n",
        "    (\"Person\", \"WORKS_AT\", \"Organization\"),\n",
        "    (\"Organization\", \"IN_LOCATION\", \"Location\"),\n",
        "    (\"Person\", \"FIELD_OF_RESEARCH\", \"ResearchField\")\n",
        "]\n",
        "rels_defined = LLMGraphTransformer(\n",
        "  llm=llm,\n",
        "  allowed_nodes=allowed_nodes,\n",
        "  allowed_relationships=allowed_relationships\n",
        ")\n",
        "data = await rels_defined.aconvert_to_graph_documents(documents)"
      ],
      "metadata": {
        "id": "Up0RRInFJadP"
      },
      "execution_count": 14,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Using the three-tuple approach provides a much more consistent schema for the extracted graph across multiple executions. However, given the nature of LLMs, there may still be some variation in the level of detail extracted. For instance, on the right side, Pierre is shown as winning the Nobel Prize, while on the left, this information is missing.\n",
        "## Defining properties\n",
        "The final enhancement we can make to the graph schema is to define properties for nodes and relationships. Here, we have two options. The first is setting either `node_properties` or `relationship_properties` to `true` allows the LLM to autonomously decide which properties to extract."
      ],
      "metadata": {
        "id": "2EKiqgFTTI6i"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "allowed_nodes = [\"Person\", \"Organization\", \"Location\", \"Award\", \"ResearchField\"]\n",
        "allowed_relationships = [\n",
        "    (\"Person\", \"SPOUSE\", \"Person\"),\n",
        "    (\"Person\", \"AWARD\", \"Award\"),\n",
        "    (\"Person\", \"WORKS_AT\", \"Organization\"),\n",
        "    (\"Organization\", \"IN_LOCATION\", \"Location\"),\n",
        "    (\"Person\", \"FIELD_OF_RESEARCH\", \"ResearchField\")\n",
        "]\n",
        "node_properties=True\n",
        "relationship_properties=True\n",
        "props_defined = LLMGraphTransformer(\n",
        "  llm=llm,\n",
        "  allowed_nodes=allowed_nodes,\n",
        "  allowed_relationships=allowed_relationships,\n",
        "  node_properties=node_properties,\n",
        "  relationship_properties=relationship_properties\n",
        ")\n",
        "data = await props_defined.aconvert_to_graph_documents(documents)\n",
        "graph.add_graph_documents(data)"
      ],
      "metadata": {
        "id": "hy_UWOAPTIZu"
      },
      "execution_count": 15,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "clean_graph()"
      ],
      "metadata": {
        "id": "Roc1d-OpTVcc"
      },
      "execution_count": 16,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "We've enabled the LLM to add any node or relationship properties it considers relevant. For instance, it chose to include Marie Curie's birth and death dates, her role as a professor at the University of Paris, and the fact that she won the Nobel Prize twice. These additional properties significantly enrich the extracted information.\n",
        "\n",
        "The second option we have is to define the node and relationship properties we want to extract."
      ],
      "metadata": {
        "id": "Sf-z8hlgTY1r"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "allowed_nodes = [\"Person\", \"Organization\", \"Location\", \"Award\", \"ResearchField\"]\n",
        "allowed_relationships = [\n",
        "    (\"Person\", \"SPOUSE\", \"Person\"),\n",
        "    (\"Person\", \"AWARD\", \"Award\"),\n",
        "    (\"Person\", \"WORKS_AT\", \"Organization\"),\n",
        "    (\"Organization\", \"IN_LOCATION\", \"Location\"),\n",
        "    (\"Person\", \"FIELD_OF_RESEARCH\", \"ResearchField\")\n",
        "]\n",
        "node_properties=[\"birth_date\", \"death_date\"]\n",
        "relationship_properties=[\"start_date\"]\n",
        "props_defined = LLMGraphTransformer(\n",
        "  llm=llm,\n",
        "  allowed_nodes=allowed_nodes,\n",
        "  allowed_relationships=allowed_relationships,\n",
        "  node_properties=node_properties,\n",
        "  relationship_properties=relationship_properties\n",
        ")\n",
        "data = await props_defined.aconvert_to_graph_documents(documents)\n",
        "graph.add_graph_documents(data)"
      ],
      "metadata": {
        "id": "26TMe1qzTWQd"
      },
      "execution_count": 17,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "clean_graph()"
      ],
      "metadata": {
        "id": "p-kWOtTvTvhk"
      },
      "execution_count": 18,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The birth and death dates remain consistent with the previous extraction. However, this time, the LLM also extracted the start date of Marie's professorship at the University of Paris.\n",
        "Properties indeed add valuable depth to the extracted information, though there are currently some limitations in this implementation:\n",
        "Properties can only be extracted using the tool-based approach.\n",
        "All properties are extracted as strings.\n",
        "Properties can only be defined globally, not per node label or relationship type.\n",
        "There is no option to customize property descriptions to guide the LLM for more precise extraction.\n",
        "\n",
        "## Strict mode\n",
        "If you thought we had perfected a way to make the LLM follow the defined schema flawlessly, I have to set the record straight. While we invested considerable effort into prompt engineering, it's challenging to get LLM, especially the less performant one, to adhere to instructions with complete accuracy. To tackle this, we introduced a post-processing step, called strict_mode, that removes any information not conforming to the defined graph schema, ensuring cleaner and more consistent results.\n",
        "By default, `strict_mode` is set to `True`, but you can disable it with the following code:\n",
        "```\n",
        "LLMGraphTransformer(\n",
        "  llm=llm,\n",
        "  allowed_nodes=allowed_nodes,\n",
        "  allowed_relationships=allowed_relationships,\n",
        "  strict_mode=False\n",
        ")\n",
        "```\n",
        "With strict mode turned off, you may get node or relationship types outside the defined graph schema, as LLMs can sometimes take creative liberties with output structure.\n",
        "## Importing graph documents into graph database\n",
        "The extracted graph documents from the LLM Graph Transformer can be imported into graph databases like Neo4j for further analysis and applications using the add_graph_documents method. We'll explore different options for importing this data to suit different use cases.\n",
        "### Default import\n",
        "You can import nodes and relationships into Neo4j using the following code."
      ],
      "metadata": {
        "id": "nz9ZBoQ2TkDp"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "graph.add_graph_documents(data)"
      ],
      "metadata": {
        "id": "zUSHSKjVTsuI"
      },
      "execution_count": 19,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "This method straightforwardly imports all nodes and relationships from the provided graph documents. We've used this approach throughout the blog post to review the results of different LLM and schema configurations."
      ],
      "metadata": {
        "id": "pJaXbq3sTy6n"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "clean_graph()"
      ],
      "metadata": {
        "id": "5fRQCAKLTwSE"
      },
      "execution_count": 20,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Base entity label\n",
        "Most graph databases support indexes to optimize data import and retrieval. In Neo4j, indexes can only be set for specific node labels. Since we might not know all the node labels in advance, we can handle this by adding a secondary base label to each node using the `baseEntityLabel` parameter. This way, we can still leverage indexing for efficient importing and retrieval without needing an index for every possible node label in the graph."
      ],
      "metadata": {
        "id": "GlF3wsQ1T1Wm"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "graph.add_graph_documents(data, baseEntityLabel=True)"
      ],
      "metadata": {
        "id": "PfErw3aAT8MR"
      },
      "execution_count": 21,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "As mentioned, using the baseEntityLabel parameter will result in each node having an additional __Entity__ label."
      ],
      "metadata": {
        "id": "ncQIir5TUDZb"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "clean_graph()"
      ],
      "metadata": {
        "id": "qOvZ5hZDUEya"
      },
      "execution_count": 22,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Include source documents\n",
        "The final option is to also import the source documents for the extracted nodes and relationships. This approach lets us track which documents each entity appeared in. You can import the source documents using the include_source parameter."
      ],
      "metadata": {
        "id": "s3vL8C77UGqv"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "graph.add_graph_documents(data, include_source=True)"
      ],
      "metadata": {
        "id": "_rDkWziVUJNP"
      },
      "execution_count": 23,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "In this visualization, the source document is highlighted in blue, with all entities extracted from it connected by `MENTIONS` relationships. This mode allows you to build retrievers that utilize both structured and unstructured search approaches."
      ],
      "metadata": {
        "id": "eYqEAoerUNdo"
      }
    }
  ]
}